Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

image.png

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Imports

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

image.png

Application train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Summary Statistics

Commentary

We can see from the descriptive statistics for Days Birth, Days employed, Days registration, Days Id publish which is a negative value and is not expected.

Missing data for application train

Distribution of the target column

Explore the distribution of values taken on by the target variable.

Days Employed

Number of Days employed is an important feature that can be used for predicting risk. However, the histogram shows that the data is not logical.

There are number of applications that we can see from the histogram for those who have cars over 60 years old.

Commentary

Application Train dataset contains most of the details with respect to loan requests. There are many missing values and this can be a matter of concern in this dataset and we need to impute these missing values. Occupation Type and Organization Type are categorical values that have 58 and 18 categories respectively. This can be useful in feature engineering.

Applicants Age

Applicants Occupation

Distribution of AMT_CREDIT

Visualize Income vs Loan Amount identified by default

Boxplot of AMT_CREDIT vs NAME_EDUCATION_TYPE with NAME_FAMILY_STATUS hue

Correlation with the target column

The distribution of the top correlated features are plotted below.

Density plots of correlated features are plotted below

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Bureau Statistics

Statistics of Bureau Balance

Commentary

As we can see Bureau Balance does not have any missing values. Bureau has some percentage of missing data as plottee above. Bureau and Bureau Balance can be used to provide accurate aggregate features.

Statistics fo Credit Card balance

Statistics of Payment Installments

Statistics of POS_CASH_balance

Aplpication Test Statistics

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

agg detour

Aggregate using one or more operations over the specified axis.

For more details see agg

DataFrame.agg(func, axis=0, *args, **kwargs**)

Aggregate using one or more operations over the specified axis.

Missing values in prevApps

feature engineering for prevApp table

feature transformer for prevApp table

Feature Engineering on Primary & Secondary Datasets

Merge secondardy dataset with Primary dataset's (application_train) target variable to understand correlation between target variable and the secondary dataset's features.

The following secondary datasets will be explored for correlation against the target variable.

Important features from Phase 1: 'AMT_ANNUITY', 'AMT_CREDIT_SUM','DAYS_CREDIT','AMT_CREDIT_SUM_OVERDUE','CREDIT_DAY_OVERDUE'

Important features from Phase 2: 'AMT_CREDIT_SUM','AMT_CREDIT_SUM_DEBT','AMT_CREDIT_SUM_LIMIT','AMT_CREDIT_MAX_OVERDUE'

Important features from Phase 1: 'MONTHS_BALANCE', 'AMT_BALANCE', 'CNT_INSTALMENT_MATURE_CUM','AMT_DRAWINGS_ATM_CURRENT' ,'AMT_INST_MIN_REGULARITY','AMT_PAYMENT_TOTAL_CURRENT'

Important features from Phase 2: 'CNT_DRAWINGS_ATM_CURRENT','AMT_CREDIT_LIMIT_ACTUAL','AMT_RECIVABLE', 'AMT_TOTAL_RECEIVABLE','AMT_RECEIVABLE_PRINCIPAL'

Important features from Phase 1: 'AMT_INSTALMENT', 'AMT_PAYMENT'

Important features from Phase 2: 'DAYS_ENTRY_PAYMENT','DAYS_INSTALMENT','NUM_INSTALMENT_VERSION'

Important features from Phase 1: 'AMT_ANNUITY', 'AMT_APPLICATION','AMT_DOWN_PAYMENT','CNT_PAYMENT','RATE_INTEREST_PRIVILEGED'

Important features from Phase 2: 'AMT_CREDIT','DAYS_FIRST_DRAWING','DAYS_LAST_DUE','HOUR_APPR_PROCESS_START','DAYS_FIRST_DUE'

Create Feature Aggregators

Engineer New Features

Engineer new features capturing relationship between income and credit amount as well as annuity and income for Application dataset

Engineer new features capturing range of annuity, application, and downpayment amounts from the Previous Application dataset

Engineer new features capturing range of annuity, application, and downpayment amounts from the Bureau dataset.

Build Pipeline for each Dataset

Prepare Datasets

Create Aggregate datasets after performing fit & transform

Join the labeled dataset

Perform data merging of primary application and secondary datasets.

Check presence of newly engineered features

Join the unlabeled dataset (i.e., the submission file)

Perform data merging of primary application and secondary datasets.

Processing pipeline

OHE when previously unseen unique values in the test/validation set

Train, validation and Test sets (and the leakage problem we have mentioned previously):

Let's look at a small usecase to tell us how to deal with this:

This last problem can be solved by using the option handle_unknown='ignore'of the OneHotEncoder, which, as the name suggests, will ignore previously unseen values when transforming the test set.

Here is a example that in action:

# Identify the categorical features we wish to consider.
cat_attribs = ['CODE_GENDER', 'FLAG_OWN_REALTY','FLAG_OWN_CAR','NAME_CONTRACT_TYPE', 
               'NAME_EDUCATION_TYPE','OCCUPATION_TYPE','NAME_INCOME_TYPE']

# Notice handle_unknown="ignore" in OHE which ignore values from the validation/test that
# do NOT occur in the training set
cat_pipeline = Pipeline([
        ('selector', DataFrameSelector(cat_attribs)),
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('ohe', OneHotEncoder(sparse=False, handle_unknown="ignore"))
    ])

Please this blog for more details of OHE when the validation/test have previously unseen unique values.

HCDR preprocessing

Dataframe Column Selector

Numerical Pipeline Set-up

Categorical Attributes

Create Consolidated Data Pipeline

Use ColumnTransformer instead of FeatureUnion

Summarize Features Considered and Lengths

Feature Engineering

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Split Application into Train-Test split

Define Pipeline

Perform cross-fold validation and Train the model

Split the training data to 10 fold to perform Crossfold validation

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Set-up Evaluation Metrics Variables

Confusion Matrix

AUC Curve

Tune Baseline Model Parameters with GridSearch

The baseline Logistic Regression model was tuned across different parameters evaluated for the following metrics:

Import necesssary libraries to determine feature importance for different classifiers:

Logistic Regression

Naive Bayes

Gradient Boosting

XGBoost

Decision Trees

Random Forest

Final Results

Feature Importance - Logistic Regression, Decision Tree and Random Forest

Set-up function to be used for Gradient Boosting and Decision Tree models. Logistic regression has a slightly different logic (which is coded in the section immediately below) so it does not use the function below.

Logistic Regression

Gradient Boosting

XGBoost

Decision Tree

Kaggle submission via the command line API

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

report submission

Click on this link

Write-up

Abstract

The HCDR project aims to create a machine learning model that can accurately predict customer defaulting on loan repayment. In Phase 1, we developed a baseline logistic regression model to achieve a ROC_AUC score of 0.74306.

In Phase 2, we wanted to improve our performance with new features and evaluate other algorithms. We engineered additional features and performed Grid Search with six classification algorithms to tune hyperparameters. XGBoost performed the best with highest test accuracy of 91.90%, AUC of 76.19%, and better precision and recall scores. Gradient Boosting came very close with accuracy and AUC scores but slightly underperformed relative to XGBoost in precision and recall. Naive Bayes performed the worst among all models with lowest accuracy at 19.5% and highest log loss at 27.8. Decision Trees and Random Forest performed no better than baseline.

Our best ROC_AUC score for Kaggle submission was 0.74779.

Project Description

Home Credit is an international non-bank financial institution that aims to lend people money regardless of their credit history. Home credit groups focus on providing a positive borrowing experience for customers who do not bank on traditional sources. Thus, Home Credit Group published a dataset on Kaggle with the goal of identifying and solving unfair loan rejection.

The purpose of this project is to create a machine learning model which can accurately predict the customer behavior on repayment of the loan. Our task is to form a pipeline to build a baseline machine learning model using logistic regression classification algorithms. The final model will be evaluated using a number of different performance metrics that we can use to create a better model. Businesses can use this model to identify if a loan is at risk to default. The new model that is built will ensure that the clients who are capable of repaying their loans are not rejected and that loans would be given with a principal, maturity, and repayment calendar that will allow their clients to be successful.

The results of the machine learning pipelines are measured by using these metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Accuracy Score, Precision, Recall, Confusion Matrix, and Area Under ROC Curve (AUC).

The results of our pipelines will be analyzed and ranked. The most efficient pipeline will be submitted to the Kaggle competition for the Home Credit Default Risk (HCDR).

Workflow

We are implementing the following workflow outlined below. In Phase 0, we understood the project modelling requirements and outlined our plans. In Phase 1, we are performing the first among three iterations of the remainder of the workflow

Data Description

The dataset contains 1 primary table and 6 seconday tables. \ \ Primary Tables

  1. application_train \ This Primary table includes the application information for each loan application at Home Credit in one row. This row includes the target variable of whether or not the loan was repaid. We use this field as the basis to determine the feature importance. The target variable is binary in nature based since this is a classification problem. \ \ The target variable takes on two different values:

    • '1' - client with payment difficulties: he/she had late payment more than N days on at least one of the first M installments of the loan in our sample
    • '0' - all other cases \ \ There are 122 variables and 307,511 data entries.
  2. application_test \ This table includes the application information for each loan application at Home Credit in one row. The features are the same as the train data but exclude the target variable. \ \ There are 121 variables and 48,744 data entries.

Secondary Tables

  1. Bureau \ This table includes all previous credits received by a customer from other financial institutions prior to their loan application. There is one row for each previous credit, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 17 variables and 1,716,428 data entries.

  2. Bureau Balance \ This table includes the monthly balance for a previous credit at other financial institutions. There is one row for each monthly balance, meaning a many-to-one relationship with the Bureau table. We could join it with bureau table by using bureau's ID, SK_ID_BUREAU. \ \ There are 3 variables and 27,299,925 data entries.

  3. Previous Application \ This table includes previous applications for loans made by the customer at Home Credit. There is one row for each previous application, meaning a many-to-one relationship with the primary table. We could join it with primary table by using current application ID, SK_ID_CURR. There are four types of contracts: a. Consumer loan(POS – Credit limit given to buy consumer goods) b. Cash loan(Client is given cash) c. Revolving loan(Credit) d. XNA (Contract type without values) \ \ There are 37 variables and 1,670,214 data entries.

  4. POS CASH Balance \ This table includes a monthly balance snapshot of a previous point of sale or cash loan that the customer has at Home Credit. There is one row for each monthly balance, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 10,001,358 data entries.

  5. Credit Card Balance \ This table includes a monthly balance snapshot of previous credit cards the customer has with Home Credit. There is one row for each previous monthly balance, meaning a many-to-one relationship with the Previous Application table.We could join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 23 variables and 3,840,312 data entries.

  6. Installments Payments \ This table includes previous repayments made or not made by the customer on credits issued by Home Credit. There is one row for each payment or missed payment, meaning a many-to-one relationship with the Previous Application table. We would join it with Previous Application table by using previous application ID, SK_ID_PREV, then join it with primary table by using current application ID, SK_ID_CURR. \ \ There are 8 variables and 13,605,401 data entries.

Data Tasks

The following data preprocessing tasks need to be achieved to prepare the datasets after downloading and unzipping the main application and secondary datasets:

  1. Analyze missing values from application_train table and feature correlations with target variable.
  2. Examine correlations between primary dataset's target variable and features from each secondary dataset.
  3. Create pipelines for primary and secondary datasets that generate minimum, maximum, and mean metrics using aggregate functions.
  4. Transform for primary and secondary datasets using the pipelines.
  5. Perform feature engineering to build new features for _previousapplication dataset.
  6. Join the primary application dataset (labeled train and unlabeled test) with secondary tables on SK_ID_CURR. A left join is used so that any loan application record IDs that are missing secondary data are not dropped and will instead be imputed (strategy discussed in pipeline).
  7. Engineer new features around claim duration attributes and Occupation Type

EDA

For the Exploratory Data Analysis component of this phase, we did a precursor analysis on the data to ensure that our results would be accurate.

We looked at summary statistics for each table in the model. We primarily focused on the data distribution, identifying statistics such as the count, mean, standard deviation, minimum, IQR, and maximum.

We also looked at specific numerical and categorical features and visualized them. We created a heatmap to identify the correlation between each feature and the target variable. We also visualized the age, occupation, and distribution of credit amounts.

Please see the Exploratory Data Analysis section for our complete EDA.

Feature Engineering and transformers

In our feature engineering process, we created two types of features to enhance our dataset. First, we created new aggregate features based on aggregate functions to capture the minimum, maximum, and mean of numerical attributes across the primary and secondary datasets that were highly correlated with the target variable.

In Phase 2, we decided to engineer the following new features from the Application and Bureau datasets:

Similar to Phase 1, we identified the highly correlated features by creating a simple function that took a secondary dataframe name as an input variable and generated a correlation matrix between all the features in the inputted dataframe and the primary dataset's target variable.

All the aggregate values were calculated from the original dataframes and a new of dataframes (comprising of primary and secondary datasets) were generated. After the secondary datasets were merged with the primary "application_train" dataset, the new consolidated application training dataframe had a total of 240 features (including the aggregate calculations for specific features).

Further, the top highly correlated features (positive and negative) were chosen from both the primary and secondary datasets. These features were then classified into numerical and categorical variables to form inputs for 2 individual pipelines. In total, our baseline model comprised of 91 features (84 numerical and 7 categorical features).

(Please see Feature Engineering section and Feature Aggregator for more details)

Pipelines

In Phase 1, we implemented Logistic Regression as a starting baseline model due to its easy implementation and low computational requirements. We used 5 fold cross-validation along with the hyperparameters to tune the model with GridSearchCV function in Scikit-learn.

Here is the high-level workflow for the model pipeline followed by detailed steps:

The rationale for the other classifier models are listed below:

We retained many of the data preprocessing procedures and data pipeline skeletal code from Phase 1. We augmented our feature engineering steps to build new features and developed a Grid Search function to tune hyperparameters and determine evaluation metrics for each classifier algorithm listed above.

  1. Download data and perform data pre-processing tasks (joining primary and secondary datasets, transformation)
  2. Create data pipeline using ColumnTransformer to combine highly correlated numerical and categorical features. Impute missing numerical attribute with mean values and categorical values with most frequent values.
  3. Create model with data pipeline and baseline model to fit training dataset
  4. Perform Grid Search on each classifier and generate evaluation metrics (accuracy score, AUC score, Log Loss, RMSE, MAE, and p-value), confusion matrix, precision recall plot, and ROC curve plots for train, validation and test datasets.
  5. Use above results to find the best performing model and submit to Kaggle.

Experimental results

Here are the experiment results for our baseline Logistic Regression model and six other classification algorithms we fine tuned. The RMSE and MAE scores are not included in the image below but can be found in the Experiment Results section.

Furthermore, we analyzed the feature importances section of Logistic Regression, Gradient Boosting, and Decision Tree models. Though XGBoost had the best overall performance in terms of accuracy, AUC, precision, and recall, we couldn't produce a chart showing feature importance due to our kernel taking too long so we have chosen to analyze the feature importance scores from Gradient Boosting, which performed very close to XGBoost:

From the feature importance chart above, the external source scores (EXT_SOURCE_3, EXT_SOURCE_2, EXT_SOURCE_1) followed by DAYS_BIRTH and DAYS_CREDIT (from bureau dataset) are the most predictive features of the target variable.

Evaluation Metrics

Since HCDR is a Classification task, we used the following metrics to measure the Model performance.

MAE

The mean absolute error is the average of the absolute values of individual prediction errors over all instances in the test set. Each prediction error is the difference between the true value and the predicted value for the instance.

$$ \text{MAE}(\mathbf{X}, h_{\mathbf{\theta}}) = \dfrac{1}{m} \sum\limits_{i=1}^{m}{| \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)}|} $$

RMSE

This root mean square error is the normalized distance between the vector of predicted values and the vector of observed values. First, the squared difference between each observed value and predicted value is calculated. RMSE is the square root of the summation of these squared differences.

$$ \text{RMSE}(\mathbf{X}, h_{\mathbf{\theta}}) = \sqrt{\dfrac{1}{m} \sum\limits_{i=1}^{m}{( \mathbf{x}^{(i)}\cdot \mathbf{\theta} - y^{(i)})^2}} $$

Accuracy Score

This metric describes the fraction of correctly classified samples. In SKLearn, it can be modified to return solely the number of correct samples.Accuracy is the default scoring method for both logistic regression and k-Nearest Neighbors in scikit-learn.

Precision

The precision is the ratio of true positives over the total number of predicted positives.

Recall

The recall is the ratio of true positives over the true positives and false negatives. Recall is assessing the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0

Confusion Matrix

The confusion matrix, in this case for a binary classification, is a 2x2 matrix that contains the count of the true positives, false positives, true negatives, and false negatives.

AUC (Area under ROC curve)

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: ▪ True Positive Rate ▪ False Positive Rate

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1).

AUC is desirable for the following two reasons:

  1. AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
  2. AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

Binary cross-entropy loss (CXE)

Binary cross-entropy loss (CXE) measures the performance of a classification model as a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label. Therefore, the objective function would need to minimize the binary CXE loss function.

The log loss formula for the binary case is as follows :

$$ -\frac{1}{m}\sum^m_{i=1}\left(y_i\cdot\:\log\:\left(p_i\right)\:+\:\left(1-y_i\right)\cdot\log\left(1-p_i\right)\right) $$

p-value

p-value is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct. A very small p-value means that such an extreme observed outcome would be very unlikely under the null hypothesis.

We will compare the classifiers with the baseline untuned model by conducting two-tailed hypothesis test.

Null Hypothesis, H0: There is no significant difference between the two machine learning pipelines. Alternate Hypothesis, HA: The two machine learning pipelines are different. A p-value less than or equal to the significance level is considered statistically significant.

Discussion

We started our experimentation with our Phase 1 baseline Logistic Regression model but with additional features from different datasets. Based on the results above, we received high accuracy scores as in Phase 1 at around 91.9% while our AUC values continued to stay around 75%. Our train data precision score was at 50% while recall score stood at 2%. When we evaluated our baseline model with the best hyperparameters, we did not observe significant improvement in our evaluation metrics.

When we ran the other classification algorithms, we found that XGBoost resulted in the best model achieving a higher test accuracy score of 92%, test AUC of 76%, and better precision and recall scores. Gradient Boosting came very close in terms of accuracy and AUC but slightly underperformed in precision and recall relative to XGBoost. Decision Tree and Random Forest (the latter being an ensemble method) did not achieve much improvement relative to our baseline model. In fact, Decision Tree did not achieve statistical significance based on the p-score (0.85).

Our worst performing model was Naive Bayes with a very low accuracy score hovering at 19.5%. We believe this has to do with the intrinsic nature of NB which operates on conditional and unconditional probabilities associated with features and not on feature weights. Another factor to consider is the presence of features that are not necessarily normally distributed.

From our Feature Importance analysis, we found that the external scores play a significant predictive role in determining risk of default. Features from the Bureau dataset and our engineered features like 'ef_ANNUAL_INCOME_PCT' enhanced our model performance.

For our Kaggle submission, we used the XGBoost with best parameters since the test accuracy was the best among all algorithms.

Conclusion

In the Home Credit Default Risk (HCDR) project, we are using Home Credit’s data to better predict loan repayment by customers with little to no credit history. In Phase 1, we developed a baseline logistic regression algorithm.

In Phase 2, we engineered new features from the bureau datasets. We performed Grid Search on six different models: Logistic Regression, Naive Bayes, Gradient Boosting, XGBoost, Decision Trees, and Random Forest. Our best performing model was XGBoost with a test accuracy of 91.90% and AUC ROC score of 76.19%. All the other models had lower results, but Gradient Boosting came very close with a test AUC_ROC score of 75.80%. The worst performing model was Naive Bayes. The ROC_AUC score for our Phase 2 Kaggle submission was 0.74779 (from Gradient Boosting), an improvement over our Phase 1 score of 0.74306.

In Phase 3, we plan to examine our feature engineering process and determine if we can increase our Kaggle AUC score with fewer features. In order to circumvent the technical challenges, we will attempt to implement PyTorch along with SVM and other models using IU Red resources.

Challenges

The challenges we faced in this phase were a continuation of those we experienced in Phase 1. We had to think hard about designing relevant features that would prove useful. As we engineered new features, we had to troubleshoot errors related to invalid calculations such as divide by zero errors. We had to constantly remind ourselves to follow the sequence of performing aggregate calculations and then engineering new features on top of them (and not the other way around). This meant we needed to be specific on which aggregate feature calculations we wanted to engineer new features from.

Our team was not able to implement the Support Vector Machine classifier successfully. All four of us tried and ended up crashing our Jupyter kernels (despite increasing our resources in Docker). This was a major roadblock as we wanted to compare another non-ensemble model like SVM's performance against Logistic Regression. In addition, we also couldn't re-implement XGBoost for Kaggle submission so we had to submit our results for Gradient Boosting model.

Along the way, we faced several technical issues in developing this notebook:

Kaggle Submission

Below is the screenshot of our best kaggle submission.

References

Some of the material in this notebook has been adopted from here

We referred to the following resources to understand the algorithms and hyperparameters to modify:

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: